Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

Title: PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model

George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, Kevin Murphy

# Pose

## Detection

Hough voting

$h_k(x) = \frac{1}{\pi R^2} \sum_{i=1:N} p_k(x_i) B(x_i + S_k(x_i) -x)$

B is bilinear interpolation kernel.

S_k is short-range offset

## Grouping

we add to our network a separate pairwise mid-range 2-D oﬀset ﬁeld output M_{k,l}(x) designed to connect pairs of keypoints. We compute 2(K − 1) such oﬀset ﬁelds, one for each directed edge connecting pairs (k, l) of keypoints which are adjacent to each other in a tree-structured kinematic graph of the person

**Mid-range pairwise oﬀsets**

Target:

$M_{k,l}(x) = (y_{j,l} - x) [ x \in \mathcal{D}_R (y_{j,k}) ]$

**Recurrent offset refinement**

$M_{k,l} (x) \leftarrow x' + S_l(x'), \text{where } x' = M_{k,l} (x)$

**Fast greedy decoding**

- create priority queue, shared across all K keypoint types, in which we insert the position x i and keypoint type k of all local maxima in the Hough score maps h_k(x) (sever as candidate seeds for starting a detection instance)
- pop elements out in descentding score order
- if the position x_i of the current candidate detection seed of type k is within a disk D_r(y_{j’},k) of the corresponding keypoint of previously detected person instances j’, then we reject it
- otherwise, start a new detection instance j with the k-th keypoint at position y_{j,k} = x_i serving as seed
- follow mid-range displacement vectors to connect pairs, setting y_{j,l} = y_{j,k} + M_{k,l} (y_{j,k})

## Scoring

## keypoint score

Expected-OKS

$s_{j,k} = \mathrm{E}\left\{ {OKS}_{j,k} \right\} = p_k (y_{j,k}) \int_{x \in \mathcal{D}_R (y_{j,k})} \hat{h}_k (x) \exp \left( - \frac{(x-y_{j,k})^2}{2 \lambda_j^2 \kappa_k^2} \right) dx$

## instance-level score

soft-NMS

we use as ﬁnal instance-level score the sum of the scores of the keypoints that have not already been claimed by higher scoring instances

$s_j = (1/K) \sum_{k=1:K} s_{j,k} \left[ \lVert y_{j,k} - y_{j',k} \rVert > r,\text{ for every } j'<j \right]$

# Instance-level person segmentation

Given keypoint-level person instance detections, identify pixels that belong to people (recognition) and associate them with the detected person instances (grouping)

- Semantic person segmentation in standard fully-convolutional fashion
- Associating segments with instances via geometric embeddings

**Long-range offset**

$L_k(x) = y_{j,k} - x$

**Refinement**

$L_k (x) \leftarrow x' + L_k (x'), x'=L_k(x) \text{ and } L_k(x) \leftarrow x' + S_k (x'), x'=L_k(x)$

**Geometric embedding**

$G(x) = (G_k(x))_{k=1,\dots,K}, G_k(x) = x+L_k(x)$

**Embedding distance**

To decide if the person pixel x_i belongs to the j-th person instance, we compute the embedding distance metric

$D_{i,j} = \frac{1}{\sum_k p_k (y_{j,k})} \sum_{k=1}^K p_k(y_{j,k}) \frac{1}{\lambda_j} \lVert G_k(x_i) - y_{j,k} \rVert$

set \lambda_j equal to the square root of the area of the bounding box tightly containing all detected keypoints of the j-th person instance.

With N_s pixels and M person instances, having complexity O(N_S * M) instead of O(N_S * N_S) of standard embedding-based segmentation techniques which, at least in principle, require computation of embedding vector distances for all pixel pairs.

**Procedure**

- find all positions x_i marked as person in semantic sementation map, i.e. those pixels that have semantic segmentation probability p_S(x_i) >= 0.5.
- associate each person pixel x_i with every detected person instance j for which the embedding distance metric satisﬁes D_{i,j} <= t