Paper Reading: Towards Accurate Multi-person Pose Estimation in the Wild

Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.


  • Stages:
    1. Predict location and scale box (Faster RCNN detector)
    2. Estimate keypoints
  • Detecting keypoints: ResNet
  • To combine outputs, introduce aggregation procedure to obtain hightly localized predictions
  • Using keypoint-based NMS instead of curder box-level NMS
  • Using keypoint-based confidence score estimation, instead of box-level scoring


  • Top-down approach
  • Stages:
    1. Faster-RCNN method on top of a ResNet-101 CNN, as in J. Huang
    2. ResNet: predict activation heatmaps and offsets for each keypoint, similar to L. Pinshchulin and E. Insafutdinov combining their predictions using a novel form of heatmap-offset aggregation
  • Avoid duplicate pose detections by keypoint-based NMS
  • Propose keypoint-based confidence score estimator rather than using Faster-RCNN box scores

J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv:1611.10012, 2016.

L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, 2016.

E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, 2016.

This part of the paper is comprehensive, great


A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, 2016.

V. Belagiannis and A. Zisserman. Recurrent human pose estimation. In arxiv, 2016.

  • Infer part relationships
  • Infer pairwise joint locations

G. Gkioxari, A. Toshev, and N. Jaitly. Chained predictions using convolutional neural networks. In ECCV, 2016.

  • Inspired by work in sequence-to-sequence
  • Predicted sequentially rather than independently
  • Parts conditioned on all other parts


Person Box Detection


Person Pose Estimation


  • Keypoint Disk Heatmap
  • Offset Field

Keypoint Disk Heatmap

hk(xi)=1 if xilkRh_k(x_i) = 1\textit{ if }\left\Vert x_i - l_k \right\Vert \le R

Offset Field

fk(xi)=j1πR2G(xj+Fk(xj)xi)hk(xj)f_k(x_i) = \sum_j \frac{1}{\pi R^2} G(x_j + F_k(x_j) - x_i) h_k(x_j)

Hough voting: each point j in the image crop grid casts a vote with its estimate for the position of every key-point, with the vote being weighted by the probability that it is in the disk of influence of the corresponding keypoint

OKS-Based Non Maximum Suppression

For person detector, use OKS instead of IOU to eliminate reduplicated detections


Heatmap Loss: sum of logistic losses for each position and keypoint separately

Offset Loss: Lo(θ)=k=1:Ki:lkxiRH(Fk(xi)(lkxi))L_o(\theta) = \sum_{k=1:K} \sum_{i:\left\Vert l_k - x_i \right\Vert \le R} H(\left\Vert F_k(x_i) - (l_k - x_i) \right\Vert)

H is Huber robust loss

Loss: L(θ)=λhLh(θ)+λoLo(θ)L(\theta) = \lambda_h L_h(\theta) + \lambda_o L_o(\theta)

Pose Rescoring

At test time, not use person detector(for generate bounding box) confidence as score


score(I)=1Kk=1Kmaxxifk(xi)score(\mathcal{I}) = \frac{1}{K} \sum^K_{k=1} \max_{x_i} f_k(x_i)


Author: Texot
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !