Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.
Hao-Shu Fang, Guansong Lu, Xiaolin Fang, Jianwen Xie, Yu-Wing Tai, Cewu Lu
Problem Definition
- dataset
- labeled training examples
- input image
- part segmentation label
- keypoints annotation
Standard Semantic Part Segmentation
- is per-pixel labeling produced by NN given parameters
- is the per-pixel loss function
Another Dataset without Segmentation Labels
is dataset of examples with only keypoints where
Proposed
Motivation
semantic labeling of body parts on a pixel-level is labor intensive
The small amount of data may lead to overfitting and degrade the performance in real world scenarios
Overview
Transfer the part segmentation annotations to unlabeled data based on pose similarity and generate extra training samples, which emphasizes semi-supervision.
- Semi-supervised method
- Using to train by morphing segmentations of pose-similar samples from

Steps
- Given and where , cluster similar pose in
- Generating part-label prior from segmentation of clustered similar samples
- Refine to (need training refinement network)
- Using together with from to train the Parsing Network (Parsing Network is one input image and output segmentation)
Cluster
- Euclidean distances between and every keypoint annotations in
- Top k
Part-label prior
Morphing
denote as part parsing results in pose-similar cluster, similarly for
Mophed part parsing results:
is affine transformation with parameters θ\theta according to and target pose
Averaging
Refinement
Refinement Network
cost function
Semi-Supervised Training for Parsing Network
[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. 2015.
[5] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016.
For our parsing network, we use the VGG-16 based model proposed in [5] due to its effective performance and simple structure. In this network, multi-scale inputs are applied to a shared VGG-16 based DeepLab model [4] for predictions. A soft attention mechanism is employed to weight the outputs of the FCN over scales.

Results
