Paper Reading: Self Adversarial Training for Human Pose Estimation


Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

Network

  • Generator
    • Fully convolutional network with residual blocks and a conv-deconv architecture (Hourglass)
  • Discriminator
    • Same as Generator, except
      • inputed with RGB image and heatmaps
      • output heatmaps used for distinguish real from fake
    • Distinguish reconstructed input heatmaps

Generator Loss

Loss = Adversial Loss + Error Loss ( Generated - GT )

LMSE=i=1Nj=1M(CijC^ij)2Ladv=j=1M(C^jD(C^j,X))2LG=LMSE+λGLadv\begin{aligned} \mathcal{L}_{\mathrm{MSE}} &= \sum^N_{i=1} \sum^M_{j=1} \left( C_{ij} - \hat{C}_{ij} \right)^2 \\ \mathcal{L}_{\mathrm{adv}} &= \sum^M_{j=1} \left( \hat{C}_j - D(\hat{C}_j, X) \right)^2 \\ \mathcal{L}_{G} &= \mathcal{L}_{\mathrm{MSE}} + \lambda_G \mathcal{L}_{\mathrm{adv}} \end{aligned}

Discriminator

Reconstruct a new set of heatmaps. Quality of reconstruction is determined by how similar to the input heatmaps. (Same notion as autoencoder) Loss is error between input heatmaps and recontructed heatmaps.

Training

Lreal=j=1M(CjD(Cj,X))2Lfake=j=1M(C^jD(C^j,X))2LD=LrealktLfake\begin{aligned} \mathcal{L}_{\mathrm{real}} &= \sum^M_{j=1} \left( C_j - D(C_j, X) \right)^2 \\ \mathcal{L}_{\mathrm{fake}} &= \sum^M_{j=1} \left( \hat{C}_j - D(\hat{C}_j, X) \right)^2 \\ \mathcal{L}_D &= \mathcal{L}_{\mathrm{real}} - k_t \mathcal{L}_{\mathrm{fake}} \end{aligned}

Minimize LD\mathcal{L}_D

  • Minimize the error between GT and Recontructed
  • Maximize the error between Generated and Reconstructed
  • The value means how good the confidence of this pixel is.
  • It offers detailed ‘comments’ on the input heatmaps and suggests which parts in the heatmaps do not yield a real pose

kt+1=kt+λk(γLrealLfake)k_{t+1} = k_t + \lambda_k (\gamma \mathcal{L}_{\mathrm{real}} - \mathcal{L}_{\mathrm{fake}})

When the generator gets better than the discriminator, i.e., L_fake is smaller than γ L_real , the generated heatmaps are real enough to fool the discriminator. Hence, k_t will increase, to make the term L_fake more dominant, and thus the discriminator will be trained more on recognizing the generated heatmaps

Adversarial Training

Note

  • Fewer original points
  • Replicating description of contribution done by others
  • Too detailed description of trivial and same things as in other experiments

Author: Texot
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !
  TOC