Paper Reading: Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer

Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

Hao-Shu Fang, Guansong Lu, Xiaolin Fang, Jianwen Xie, Yu-Wing Tai, Cewu Lu

Problem Definition

  • Ds={Ii,Si,Ki}i=1N\mathcal{D}_s = \{I_i, S_i, K_i\}^N_{i=1} dataset
  • NN labeled training examples
  • IiRh×w×3I_i \in \mathrm{R}^{h \times w \times 3} input image
  • SiRh×w×uS_i \in \mathrm{R}^{h \times w \times u} part segmentation label
  • KiRv×2K_i \in \mathrm{R}^{v \times 2} keypoints annotation

Standard Semantic Part Segmentation

E(Φ)=ije[fpj(Ii;Φ),Si(j)]\mathcal{E}(\Phi) = \sum_i \sum_j e\left[ f_p^j(I_i;\Phi), S_i(j) \right]

  • fpj(Ij;Φ)f^j_p(I_j;\Phi) is per-pixel labeling produced by NN given parameters Φ\Phi
  • e[]e[\cdot] is the per-pixel loss function

Another Dataset without Segmentation Labels

Dp={Ii,Ki}i=1M\mathcal{D}_p = \{I_i, K_i\}^M_{i=1} is dataset of MM examples with only keypoints where MNM \gg N



semantic labeling of body parts on a pixel-level is labor intensive

The small amount of data may lead to overfitting and degrade the performance in real world scenarios


Transfer the part segmentation annotations to unlabeled data based on pose similarity and generate extra training samples, which emphasizes semi-supervision.

  • Semi-supervised method
  • Using Dp\mathcal{D}_p to train by morphing segmentations of pose-similar samples from Ds\mathcal{D}_s


  1. Given ItI_t and KtK_t where (It,Kt)Dp(I_t, K_t) \in \mathcal{D}_p, cluster similar pose in Ds\mathcal{D}_s
  2. Generating part-label prior PtP_t from segmentation of clustered similar samples
  3. Refine PtP_t to S^t\hat{S}_t (need training refinement network)
  4. Using S^t\hat{S}_t together with SiS_i from Ds\mathcal{D}_s to train the Parsing Network (Parsing Network is one input image and output segmentation)


  • Euclidean distances between KtK_t and every keypoint annotations in Ds\mathcal{D_s}
  • Top k

Part-label prior


denote S={S1,,Sn}\mathbb{S} = \{ S_1, \dots, S_n \} as part parsing results in pose-similar cluster, similarly for K={K1,,Kn}\mathbb{K} = \{ K_1, \dots, K_n \}

Mophed part parsing results:

S~={T(Si;θi)1in,SiS}\widetilde{\mathbb{S}} = \{ T(S_i;\theta_i) \mid 1 \le i \le n, S_i \in \mathbb{S} \}

T()T(\cdot) is affine transformation with parameters θ\theta according to KiK_i and target pose KtK_t


Pt=1ni=1nS~iP_t = \frac{1}{n} \sum_{i=1}^n \widetilde{S}_i


S^t=fr(It,Pt;Ψ)\widehat{S}_t = f_r(I_t, P_t; \Psi)

Refinement Network

cost function

E(Ψ)=jSm(j)frj(Im,Pm;Ψ)1\mathcal{E}(\Psi) = \sum_j \lVert S_m(j) - f_r^j(I_m, P_m; \Psi) \rVert_1

Semi-Supervised Training for Parsing Network

[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. 2015.

[5] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016.

For our parsing network, we use the VGG-16 based model proposed in [5] due to its effective performance and simple structure. In this network, multi-scale inputs are applied to a shared VGG-16 based DeepLab model [4] for predictions. A soft attention mechanism is employed to weight the outputs of the FCN over scales.


Author: Texot
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !