Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.
Title: Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation
This work use adversarial network to improve the data augmentation process during training. It introduces how the data augmentation network works and how it is jointly trained with a pose estimater.
Abstract
Jointly training data augmentation and network, the data augmentation will try to generate “hard” examples, exploring the weakness of the target network
Introduction

Data costly

Natural images follow a longtail [1,2] Distribution
[1] X. Zhu, D. Anguelov, and D. Ramanan. Capturing longtail distributions of object subcategories. In CVPR, 2014.
[2] Z. Tang, Y. Zhang, Z. Li, and H. Lu. Face clustering in videos with proportion prior. In IJCAI, 2015.
Random aug. problem:

Not considering individual diff

Not matching training status (static distrib.)

Gaussian distrib. not addressing longtail issue
Proposed

Adversarial learning

Augmentation network as generator
 create “hard” augs to make pose estimater fail

Pose network as discriminator

Generate adversarial data augs online

conditioned on

input images

training status



Transformations:

Straight forward design is problematic in convergence
 generating adversairal pixels, deformations

Instead, to sample scaling, rotating, occluding and so on to create data points


reward and penalty policy (addressing the issue of missing supervisions)

Augmentation network takes hierarchical features of pose network as input, instead of a raw image
Adversarial Data Aug
Augmentation network $G( \cdot \mid \theta_G )$
 generate hard augs
Pose network $D(\cdot \mid \theta_D)$

learn from adversarial augmentations

evaluate quality of generations
Generating path
$G$ outputs adversarial aug result $\tau_a (\cdot)$
Random aug result $\tau_r (\cdot)$
Maxmizing
$\tau_a \sim G(\mathbf{x}, \theta_D)$ means the generation of $G$ is conditioned both on image $\mathbf{x}$ and current status of target network $G$
$\mathcal{L}(\cdot, \cdot)$ is a predefined loss function and $\mathbf{y}$ is annotation
Discrimination path
play two roles:

$D$ evaluates generation quality as indicated in Eq. 1

Learn from adversarial generations
Minimizing
Joint training
Aug op not differentiable
Propose a reward and penalty policy to create online GT of $G$
Adversarial Human Pose Estimation
Adversarial Scaling and Rotating (ASR)
divide the augmentation ranges (scaling and rotating parameters?) into $m$ and $n$ bins, each corresponds to a small bounded Gaussian
predict distributions over bins
sampling from distributions
ASR pretraining

For every training image, sample totally $m \times n$ augs, each drawn from a pair of Gaussians

Fed into pose network to calc the loss (representing difficulty of this aug)

Accumulate these losses into corresponding bins (result in $m \times n$ sum)

Normalizing the sum, generate two vectors of probabilities: $P^s \in \mathbb{R}^m$ and $P^r \in \mathbb{R}^n$, approx the GT of scaling and rotating distrib

Given approx GT distrib $P^s \in \mathbb{R}^m$ and $P^r \in \mathbb{R}^n$, use KLdivergence loss to pretrain the aug net
($\tilde{P}^{(\cdot)}_i$ is distribution predicted by aug net)
Advantages of predicting distribs instead of direct aug:

introduce uncertainties to avoid upsidedown aug

help to address missing GT issue during joint training
Adversarial Hierarchical Occluding (AHO)

Occluding part of the image(features), encouraged to learn strong references among visible and invisible joints.

more effective to occlude features than pixels

generate a mask ($4\times 4$, then scaled up to scales of features) indicating occluded part
AHO pretraining
aug net predict an occluding distrib. instead of mask
To create GT of occluding distrib.:

vote a joint to one of $w \times h$($4\times 4$) , indicating importance

counting and normalizing, generate $P^O \in \mathbb{R}^{w\times h}$
Pretrain:
Joint Training of Two Networks
Reward and penalty
Update according to current status of target net

evaluated by comparing with a reference

if adversarial aug is harder

rewarding, by increasing probability of the bin or cell


Otherwise

penalizing, decreasing probability

Equally split every mini batch into three shares:

random

ASR

AHO