Paper Reading: Cascaded Pyramid Network for Multi-Person Pose Estimation

Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.


Main reason unable to localize points

  1. “hard” joints cannot be simply recognized based on their appearance features only
  2. “hard” joints not explicitly addressed during the training process

Two stages

  1. GlobalNet
    • learns feature representation based on feature pyramid network, providing sufficient context info
  2. RefineNet
    • based on pyramid features
    • explicitly address “hard” joints based on an online hard keypoints mining loss

Top-down pipeline

  1. Human detector
    • generate bounding boxes
  2. Cascaded Pyramid Network
    • keypoint localization


  • Cascaded Pyramid Network (CPN): GlobalNet + RefineNet
  • eval. effects of various factors in top-down
  • SoTA in COCO




T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.

K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask RCNN. arXiv preprint arXiv:1703.06870, 2017.


Human Detector

Feature pyramid networks (FPN)

T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.

ROIAlign from Mask RCNN is adopted to replace the ROIPooling in FPN.

Cascaded Pyramid Network (CPN)

Hourglass - stacking two instead of stacking eight is sufficient

Motivated by Hourglass[1] and [2]

[1] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499, 2016.

[2] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy. Towards Accurate Multiperson Pose Estimation in the Wild. ArXiv e-prints, Jan. 2017.


Based on ResNet


Shallow features have higher resolution but lower semantic information. Usually, U-shape structure is used to maintain both spatial resolution and semantic info.

FPN further improved U-shape structure with deeply supervised info

T. Y. Lin, P. Dollr, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. 2016.


Slightly different from FPN, 1x1 conv applied before each element-wise sum in upsampling


Effectively locate keypoints like eyes but may fail to precisely locate hips

They usually requires more context rather than near features.


  • explicitly address hard keypoints
  • transmit information across different levels and finally integrate the info of diff levels via upsampling and concatenating as HyperNet

T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards accurate region proposal generation and joint object detection. In Computer Vision and Pattern Recognition, pages 845–853, 2016.

  • different from hourglass
    • RefineNet concatenates all pyramid features, not using upsampled features at the end
    • more bottleneck in deeper layers
  • explicityly select hard keypoint online based on the training loss, only backp. selected losses


Experimental Setup

Testing Details

A 2D gaussian filter applied on the heatmaps

Ablation Experiment

Person Detector

NMS strategies

Soft-NMS better

N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Improving Object Detection With One Line of Code. ArXiv e-prints, Apr. 2017.

Detection Performance

Keypoints detection AP gains less and less as the accuracy of the detection boxes increases.


A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016.


Online Hard Keypoints Mining

Design Choices of RefineNet

  • Concatenate (Concat) operation is directly attached like HyperNet
  • a bottleneck block is attached first in each layer (C_2 ~ C_5) and then followed by a concatenate operation
  • different number of bottleneck blocks applied to different layers followed by a concatenate operation as shown in Figure 1

Author: Texot
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !