Paper Reading: Self Adversarial Training for Human Pose Estimation

Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.


  • Generator
    • Fully convolutional network with residual blocks and a conv-deconv architecture (Hourglass)
  • Discriminator
    • Same as Generator, except
      • inputed with RGB image and heatmaps
      • output heatmaps used for distinguish real from fake
    • Distinguish reconstructed input heatmaps

Generator Loss

Loss = Adversial Loss + Error Loss ( Generated - GT )

LMSE=i=1Nj=1M(CijC^ij)2Ladv=j=1M(C^jD(C^j,X))2LG=LMSE+λGLadv\begin{aligned} \mathcal{L}_{\mathrm{MSE}} &= \sum^N_{i=1} \sum^M_{j=1} \left( C_{ij} - \hat{C}_{ij} \right)^2 \\ \mathcal{L}_{\mathrm{adv}} &= \sum^M_{j=1} \left( \hat{C}_j - D(\hat{C}_j, X) \right)^2 \\ \mathcal{L}_{G} &= \mathcal{L}_{\mathrm{MSE}} + \lambda_G \mathcal{L}_{\mathrm{adv}} \end{aligned}


Reconstruct a new set of heatmaps. Quality of reconstruction is determined by how similar to the input heatmaps. (Same notion as autoencoder) Loss is error between input heatmaps and recontructed heatmaps.


Lreal=j=1M(CjD(Cj,X))2Lfake=j=1M(C^jD(C^j,X))2LD=LrealktLfake\begin{aligned} \mathcal{L}_{\mathrm{real}} &= \sum^M_{j=1} \left( C_j - D(C_j, X) \right)^2 \\ \mathcal{L}_{\mathrm{fake}} &= \sum^M_{j=1} \left( \hat{C}_j - D(\hat{C}_j, X) \right)^2 \\ \mathcal{L}_D &= \mathcal{L}_{\mathrm{real}} - k_t \mathcal{L}_{\mathrm{fake}} \end{aligned}

Minimize LD\mathcal{L}_D

  • Minimize the error between GT and Recontructed
  • Maximize the error between Generated and Reconstructed
  • The value means how good the confidence of this pixel is.
  • It offers detailed ‘comments’ on the input heatmaps and suggests which parts in the heatmaps do not yield a real pose

kt+1=kt+λk(γLrealLfake)k_{t+1} = k_t + \lambda_k (\gamma \mathcal{L}_{\mathrm{real}} - \mathcal{L}_{\mathrm{fake}})

When the generator gets better than the discriminator, i.e., L_fake is smaller than γ L_real , the generated heatmaps are real enough to fool the discriminator. Hence, k_t will increase, to make the term L_fake more dominant, and thus the discriminator will be trained more on recognizing the generated heatmaps

Adversarial Training


  • Fewer original points
  • Replicating description of contribution done by others
  • Too detailed description of trivial and same things as in other experiments

Author: Texot
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !