Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.
Abstract
- Stages:
- Predict location and scale box (Faster RCNN detector)
- Estimate keypoints
- Detecting keypoints: ResNet
- To combine outputs, introduce aggregation procedure to obtain hightly localized predictions
- Using keypoint-based NMS instead of curder box-level NMS
- Using keypoint-based confidence score estimation, instead of box-level scoring
Introduction
- Top-down approach
- Stages:
- Faster-RCNN method on top of a ResNet-101 CNN, as in J. Huang
- ResNet: predict activation heatmaps and offsets for each keypoint, similar to L. Pinshchulin and E. Insafutdinov combining their predictions using a novel form of heatmap-offset aggregation
- Avoid duplicate pose detections by keypoint-based NMS
- Propose keypoint-based confidence score estimator rather than using Faster-RCNN box scores
J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. arXiv:1611.10012, 2016.
L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, 2016.
E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, 2016.
Related
This part of the paper is comprehensive, great
Notice
A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, 2016.
V. Belagiannis and A. Zisserman. Recurrent human pose estimation. In arxiv, 2016.
- Infer part relationships
- Infer pairwise joint locations
G. Gkioxari, A. Toshev, and N. Jaitly. Chained predictions using convolutional neural networks. In ECCV, 2016.
- Inspired by work in sequence-to-sequence
- Predicted sequentially rather than independently
- Parts conditioned on all other parts
Method

Person Box Detection
Faster-RCNN
Person Pose Estimation
Regression:
- Keypoint Disk Heatmap
- Offset Field

Keypoint Disk Heatmap
Offset Field
Hough voting: each point j in the image crop grid casts a vote with its estimate for the position of every key-point, with the vote being weighted by the probability that it is in the disk of influence of the corresponding keypoint
OKS-Based Non Maximum Suppression
For person detector, use OKS instead of IOU to eliminate reduplicated detections
Training
Heatmap Loss: sum of logistic losses for each position and keypoint separately
Offset Loss:
H is Huber robust loss
Loss:
Pose Rescoring
At test time, not use person detector(for generate bounding box) confidence as score
Use
Result
