Paper Reading: Adversarial Data Augmentation in Human Pose Estimation

Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

Title: Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation

This work use adversarial network to improve the data augmentation process during training. It introduces how the data augmentation network works and how it is jointly trained with a pose estimater.


Jointly training data augmentation and network, the data augmentation will try to generate “hard” examples, exploring the weakness of the target network


  1. Data costly

  2. Natural images follow a long-tail [1,2] Distribution

[1] X. Zhu, D. Anguelov, and D. Ramanan. Capturing long-tail distributions of object subcategories. In CVPR, 2014.

[2] Z. Tang, Y. Zhang, Z. Li, and H. Lu. Face clustering in videos with proportion prior. In IJCAI, 2015.

Random aug. problem:

  1. Not considering individual diff

  2. Not matching training status (static distrib.)

  3. Gaussian distrib. not addressing long-tail issue


  • Adversarial learning

  • Augmentation network as generator

    • create “hard” augs to make pose estimater fail
  • Pose network as discriminator

  • Generate adversarial data augs online

    • conditioned on

      • input images

      • training status

  • Transformations:

    • Straight forward design is problematic in convergence

      • generating adversairal pixels, deformations
    • Instead, to sample scaling, rotating, occluding and so on to create data points

  • reward and penalty policy (addressing the issue of missing supervisions)

  • Augmentation network takes hierarchical features of pose network as input, instead of a raw image

Adversarial Data Aug

Augmentation network G(θG)G( \cdot \mid \theta_G )

  • generate hard augs

Pose network D(θD)D(\cdot \mid \theta_D)

  • learn from adversarial augmentations

  • evaluate quality of generations

Generating path

GG outputs adversarial aug result τa()\tau_a (\cdot)

Random aug result τr()\tau_r (\cdot)


τaG(x,θD)\tau_a \sim G(\mathbf{x}, \theta_D) means the generation of GG is conditioned both on image x\mathbf{x} and current status of target network GG

L(,)\mathcal{L}(\cdot, \cdot) is a predefined loss function and y\mathbf{y} is annotation

Discrimination path

play two roles:

  1. DD evaluates generation quality as indicated in Eq. 1

  2. Learn from adversarial generations


Joint training

Aug op not differentiable

Propose a reward and penalty policy to create online GT of GG

Adversarial Human Pose Estimation

Adversarial Scaling and Rotating (ASR)

divide the augmentation ranges (scaling and rotating parameters?) into mm and nn bins, each corresponds to a small bounded Gaussian

predict distributions over bins

sampling from distributions

ASR pre-training

  1. For every training image, sample totally m×nm \times n augs, each drawn from a pair of Gaussians

  2. Fed into pose network to calc the loss (representing difficulty of this aug)

  3. Accumulate these losses into corresponding bins (result in m×nm \times n sum)

  4. Normalizing the sum, generate two vectors of probabilities: PsRmP^s \in \mathbb{R}^m and PrRnP^r \in \mathbb{R}^n, approx the GT of scaling and rotating distrib

  5. Given approx GT distrib PsRmP^s \in \mathbb{R}^m and PrRnP^r \in \mathbb{R}^n, use KL-divergence loss to pre-train the aug net

(P~i()\tilde{P}^{(\cdot)}_i is distribution predicted by aug net)

Advantages of predicting distribs instead of direct aug:

  1. introduce uncertainties to avoid upside-down aug

  2. help to address missing GT issue during joint training

Adversarial Hierarchical Occluding (AHO)

  • Occluding part of the image(features), encouraged to learn strong references among visible and invisible joints.

  • more effective to occlude features than pixels

  • generate a mask (4×44\times 4, then scaled up to scales of features) indicating occluded part

AHO pre-training

aug net predict an occluding distrib. instead of mask

To create GT of occluding distrib.:

  1. vote a joint to one of w×hw \times h(4×44\times 4) , indicating importance

  2. counting and normalizing, generate PORw×hP^O \in \mathbb{R}^{w\times h}


Joint Training of Two Networks

Reward and penalty

Update according to current status of target net

  • evaluated by comparing with a reference

  • if adversarial aug is harder

    • rewarding, by increasing probability of the bin or cell

  • Otherwise

    • penalizing, decreasing probability

Equally split every mini batch into three shares:

  • random

  • ASR

  • AHO

Author: Texot
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !