Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers. ## Network

• Generator
• Fully convolutional network with residual blocks and a conv-deconv architecture (Hourglass)
• Discriminator
• Same as Generator, except
• inputed with RGB image and heatmaps
• output heatmaps used for distinguish real from fake
• Distinguish reconstructed input heatmaps

## Generator Loss

Loss = Adversial Loss + Error Loss ( Generated - GT )

\begin{aligned} \mathcal{L}_{\mathrm{MSE}} &= \sum^N_{i=1} \sum^M_{j=1} \left( C_{ij} - \hat{C}_{ij} \right)^2 \\ \mathcal{L}_{\mathrm{adv}} &= \sum^M_{j=1} \left( \hat{C}_j - D(\hat{C}_j, X) \right)^2 \\ \mathcal{L}_{G} &= \mathcal{L}_{\mathrm{MSE}} + \lambda_G \mathcal{L}_{\mathrm{adv}} \end{aligned}

## Discriminator

Reconstruct a new set of heatmaps. Quality of reconstruction is determined by how similar to the input heatmaps. (Same notion as autoencoder) Loss is error between input heatmaps and recontructed heatmaps.

### Training

\begin{aligned} \mathcal{L}_{\mathrm{real}} &= \sum^M_{j=1} \left( C_j - D(C_j, X) \right)^2 \\ \mathcal{L}_{\mathrm{fake}} &= \sum^M_{j=1} \left( \hat{C}_j - D(\hat{C}_j, X) \right)^2 \\ \mathcal{L}_D &= \mathcal{L}_{\mathrm{real}} - k_t \mathcal{L}_{\mathrm{fake}} \end{aligned}

### Minimize $\mathcal{L}_D$

• Minimize the error between GT and Recontructed
• Maximize the error between Generated and Reconstructed
• The value means how good the conﬁdence of this pixel is.
• It offers detailed ‘comments’ on the input heatmaps and suggests which parts in the heatmaps do not yield a real pose

$k_{t+1} = k_t + \lambda_k (\gamma \mathcal{L}_{\mathrm{real}} - \mathcal{L}_{\mathrm{fake}})$

When the generator gets better than the discriminator, i.e., L_fake is smaller than γ L_real , the generated heatmaps are real enough to fool the discriminator. Hence, k_t will increase, to make the term L_fake more dominant, and thus the discriminator will be trained more on recognizing the generated heatmaps 