Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.
Ke Sun, Cuiling Lan, Junliang Xing,Wenjun Zeng, Dong Liu, Jingdong Wang
Focus on spatial configuration refinement by reducing variations
Motivated by observation: scattered distribution of the relative locations of joints
- Two-stage normalization scheme
- human body norm
- limb norm
- To make distribution of the relative joint locations compact
- Multi-scale supervision and multi-scale fusion is beneficial
Two key problems
- joint detection
- spatial refinement: this work
- human body normalization: rotating the human body to upright according to joint detection results, followed by global spatial refinement
- limb normaliztion: rotating the joints of each limb to make the relative positions more compact, four total limb normlization modules, each followed by a spatial limb refinement module
- effective normalization schemes to faciliate the learning of conv. spatial models, can be applied following different joint detectors
- show the improvement by using multi-scale supervision and fusion
Joint detection model
- regress joint positions, e.g. DeepPose
- scoremap of joints
- classification problem, e.g. Convolutional pose machines, Hourglass
- regression problem, e.g. , 
 J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In CVPR, 2016.
 J. J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efﬁcient object localization using convolutional networks. In CVPR, 2015.
Problem of FCN-based: the positions of joints are estimated from low resolution score maps.
Joint Relation Model
Pictorial structures define deformable configurations by sprint-like connections between pairs of parts to model complex joint relations
Y. Yang and D. Ramanan. Articulated pose estimation with ﬂexible mixtures-of-parts. In CVPR, 2011.
L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Poselet conditioned pictorial structures. In CVPR, 2013.
To model the human poses with large variations, a mixture model is usually learned for each joint.
W. Ouyang, X. Chu, and X. Wang. Multi-source deep learning for human pose estimation. In CVPR, 2014.
X. Chen and A. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS, 2014.
W. Yang, W. Ouyang, H. Li, and X. Wang. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR, 2016.
Markov Random Field (MRF)
Tompson et al. formulates the spatial relations as a Markov Random Field (MRF) like model over the distribution of spatial locations for each body part.
J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014.
Structured Feature Learning
adapts geometrical transform kernels to capture the spatial relationships of joints from feature maps.
X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature learning for pose estimation. In CVPR, 2016.
- Joint detection
- Spatial configuration refinement
detector: K+1, K for joint score maps, one for non-join (background) score map
Spatial Configuration Refinement
- Global normalization
- global normalization module
- refinement module (refine all K joints)
- Semi-global refinement and local refinement
- Semi-global refinement
- Local refinement
- four branches, each
- correspond to a limb
- local limb normaliztion module + local refinement module
- four branches, each
Make center to neck upright
The end joints on the four limbs have highter variations
Make 4 limbs upright
- root joint (shoulder, hip)
- middle joint (elbow, knee)
- end joint (wrist, ankle)
rotating the corresponding three score maps around the root joint such that the line connecting the root joint and the middle joint has a consistent orientation, e.g., vertical downwards