Paper Reading: Human Pose Estimation using Global and Local Normalization

Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

Ke Sun, Cuiling Lan, Junliang Xing,Wenjun Zeng, Dong Liu, Jingdong Wang


Focus on spatial configuration refinement by reducing variations

Motivated by observation: scattered distribution of the relative locations of joints

Proposed model

  • Two-stage normalization scheme
    • human body norm
    • limb norm
  • To make distribution of the relative joint locations compact
  • Multi-scale supervision and multi-scale fusion is beneficial


Two key problems

  • joint detection
  • spatial refinement: this work

Two normalization

  • human body normalization: rotating the human body to upright according to joint detection results, followed by global spatial refinement
  • limb normaliztion: rotating the joints of each limb to make the relative positions more compact, four total limb normlization modules, each followed by a spatial limb refinement module

Main contribution

  • effective normalization schemes to faciliate the learning of conv. spatial models, can be applied following different joint detectors
  • show the improvement by using multi-scale supervision and fusion

Joint detection model

  • regress joint positions, e.g. DeepPose
  • scoremap of joints

Estimation procedure

  • classification problem, e.g. Convolutional pose machines, Hourglass
  • regression problem, e.g. [1], [2]

[1] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In CVPR, 2016.

[2] J. J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. In CVPR, 2015.

Problem of FCN-based: the positions of joints are estimated from low resolution score maps.

Joint Relation Model

Pictorial structures define deformable configurations by sprint-like connections between pairs of parts to model complex joint relations

Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In CVPR, 2011.

L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Poselet conditioned pictorial structures. In CVPR, 2013.

Pictorial structures Extended to CNN

To model the human poses with large variations, a mixture model is usually learned for each joint.

W. Ouyang, X. Chu, and X. Wang. Multi-source deep learning for human pose estimation. In CVPR, 2014.

X. Chen and A. Yuille. Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS, 2014.

W. Yang, W. Ouyang, H. Li, and X. Wang. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In CVPR, 2016.

Markov Random Field (MRF)

Tompson et al. formulates the spatial relations as a Markov Random Field (MRF) like model over the distribution of spatial locations for each body part.

J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS, 2014.

Structured Feature Learning

adapts geometrical transform kernels to capture the spatial relationships of joints from feature maps.

X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature learning for pose estimation. In CVPR, 2016.


face normalization


  • Joint detection
  • Spatial configuration refinement

detector: K+1, K for joint score maps, one for non-join (background) score map

Spatial Configuration Refinement

  1. Global normalization
    1. global normalization module
    2. refinement module (refine all K joints)
  2. Semi-global refinement and local refinement
    1. Semi-global refinement
    2. Local refinement
      • four branches, each
        • correspond to a limb
        • local limb normaliztion module + local refinement module

Body Normalization

Make center to neck upright

Limb Normalization

The end joints on the four limbs have highter variations

Make 4 limbs upright

Each limb:

  • root joint (shoulder, hip)
  • middle joint (elbow, knee)
  • end joint (wrist, ankle)

rotating the corresponding three score maps around the root joint such that the line connecting the root joint and the middle joint has a consistent orientation, e.g., vertical downwards

Multi-scale Supervision and Fusion for Joint Detection

Author: Texot
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !