Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

## Abstract

- Stages:
- Predict location and scale box (Faster RCNN detector)
- Estimate keypoints

- Detecting keypoints: ResNet
- To combine outputs,
__introduce aggregation procedure to obtain hightly localized predictions__ - Using keypoint-based NMS instead of curder box-level NMS
- Using keypoint-based confidence score estimation, instead of box-level scoring

## Introduction

- Top-down approach
- Stages:
- Faster-RCNN method on top of a ResNet-101 CNN, as in
*J. Huang* - ResNet: predict activation heatmaps and offsets for each keypoint, similar to
*L. Pinshchulin*and*E. Insafutdinov*combining their predictions using a novel form of heatmap-offset aggregation

- Faster-RCNN method on top of a ResNet-101 CNN, as in
- Avoid duplicate pose detections by keypoint-based NMS
- Propose keypoint-based confidence score estimator rather than using Faster-RCNN box scores

J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al.

Speed/accuracy trade-offs for modern convolutional object detectors. arXiv:1611.10012, 2016.L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele.

Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, 2016.E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele.

Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, 2016.

## Related

**This part of the paper is comprehensive, great**

Notice

A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, 2016.

V. Belagiannis and A. Zisserman. Recurrent human pose estimation. In arxiv, 2016.

- Infer part relationships
- Infer
**pairwise**joint locations

G. Gkioxari, A. Toshev, and N. Jaitly. Chained predictions using convolutional neural networks. In ECCV, 2016.

- Inspired by work in sequence-to-sequence
- Predicted sequentially rather than independently
- Parts conditioned on all other parts

## Method

### Person Box Detection

Faster-RCNN

### Person Pose Estimation

Regression:

- Keypoint Disk Heatmap
- Offset Field

**Keypoint Disk Heatmap**

$h_k(x_i) = 1\textit{ if }\left\Vert x_i - l_k \right\Vert \le R$

**Offset Field**

$f_k(x_i) = \sum_j \frac{1}{\pi R^2} G(x_j + F_k(x_j) - x_i) h_k(x_j)$

Hough voting: each point j in the image crop grid casts a vote with its estimate for the position of every key-point, with the vote being weighted by the probability that it is in the disk of inﬂuence of the corresponding keypoint

#### OKS-Based Non Maximum Suppression

For person detector, use OKS instead of IOU to eliminate reduplicated detections

#### Training

Heatmap Loss: sum of logistic losses for each position and keypoint separately

Offset Loss: $L_o(\theta) = \sum_{k=1:K} \sum_{i:\left\Vert l_k - x_i \right\Vert \le R} H(\left\Vert F_k(x_i) - (l_k - x_i) \right\Vert)$

*H* is Huber robust loss

Loss: $L(\theta) = \lambda_h L_h(\theta) + \lambda_o L_o(\theta)$

#### Pose Rescoring

At test time, not use person detector(for generate bounding box) confidence as score

Use

$score(\mathcal{I}) = \frac{1}{K} \sum^K_{k=1} \max_{x_i} f_k(x_i)$