Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

Hao-Shu Fang, Guansong Lu, Xiaolin Fang, Jianwen Xie, Yu-Wing Tai, Cewu Lu

## Problem Definition

- $\mathcal{D}_s = \{I_i, S_i, K_i\}^N_{i=1}$ dataset
- $N$ labeled training examples
- $I_i \in \mathrm{R}^{h \times w \times 3}$ input image
- $S_i \in \mathrm{R}^{h \times w \times u}$ part segmentation label
- $K_i \in \mathrm{R}^{v \times 2}$ keypoints annotation

### Standard Semantic Part Segmentation

$\mathcal{E}(\Phi) = \sum_i \sum_j e\left[ f_p^j(I_i;\Phi), S_i(j) \right]$

- $f^j_p(I_j;\Phi)$ is per-pixel labeling produced by NN given parameters $\Phi$
- $e[\cdot]$ is the per-pixel loss function

### Another Dataset without Segmentation Labels

$\mathcal{D}_p = \{I_i, K_i\}^M_{i=1}$ is dataset of $M$ examples with only keypoints where $M \gg N$

## Proposed

### Motivation

semantic labeling of body parts on a pixel-level is labor intensive

The **small amount of data** may lead to overﬁtting and degrade the performance in real world scenarios

### Overview

Transfer the part segmentation annotations to unlabeled data based on pose similarity and generate extra training samples, which emphasizes semi-supervision.

- Semi-supervised method
- Using $\mathcal{D}_p$ to train by morphing segmentations of pose-similar samples from $\mathcal{D}_s$

### Steps

- Given $I_t$ and $K_t$ where $(I_t, K_t) \in \mathcal{D}_p$, cluster similar pose in $\mathcal{D}_s$
- Generating part-label prior $P_t$ from segmentation of clustered similar samples
- Refine $P_t$ to $\hat{S}_t$ (need training refinement network)
- Using $\hat{S}_t$ together with $S_i$ from $\mathcal{D}_s$ to train the Parsing Network (Parsing Network is one input image and output segmentation)

### Cluster

- Euclidean distances between $K_t$ and every keypoint annotations in $\mathcal{D_s}$
- Top k

### Part-label prior

#### Morphing

denote $\mathbb{S} = \{ S_1, \dots, S_n \}$ as part parsing results in pose-similar cluster, similarly for $\mathbb{K} = \{ K_1, \dots, K_n \}$

Mophed part parsing results:

$\widetilde{\mathbb{S}} = \{ T(S_i;\theta_i) \mid 1 \le i \le n, S_i \in \mathbb{S} \}$

$T(\cdot)$ is affine transformation with parameters θ\theta according to $K_i$ and target pose $K_t$

#### Averaging

$P_t = \frac{1}{n} \sum_{i=1}^n \widetilde{S}_i$

#### Refinement

$\widehat{S}_t = f_r(I_t, P_t; \Psi)$

#### Refinement Network

cost function

$\mathcal{E}(\Psi) = \sum_j \lVert S_m(j) - f_r^j(I_m, P_m; \Psi) \rVert_1$

#### Semi-Supervised Training for Parsing Network

[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. 2015.

[5] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016.

For our parsing network, we use the **VGG-16 based model proposed in [5]** due to its effective performance and simple structure. In this network, **multi-scale inputs** are applied to a shared VGG-16 based DeepLab model [4] for predictions. A soft attention mechanism is employed to weight the outputs of the FCN over scales.