Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.

Title: PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model

George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, Kevin Murphy

# Pose

## Detection Hough voting

$h_k(x) = \frac{1}{\pi R^2} \sum_{i=1:N} p_k(x_i) B(x_i + S_k(x_i) -x)$

B is bilinear interpolation kernel.

S_k is short-range offset

## Grouping

we add to our network a separate pairwise mid-range 2-D oﬀset ﬁeld output M_{k,l}(x) designed to connect pairs of keypoints. We compute 2(K − 1) such oﬀset ﬁelds, one for each directed edge connecting pairs (k, l) of keypoints which are adjacent to each other in a tree-structured kinematic graph of the person

Mid-range pairwise oﬀsets

Target:

$M_{k,l}(x) = (y_{j,l} - x) [ x \in \mathcal{D}_R (y_{j,k}) ]$

Recurrent offset refinement

$M_{k,l} (x) \leftarrow x' + S_l(x'), \text{where } x' = M_{k,l} (x)$

Fast greedy decoding

1. create priority queue, shared across all K keypoint types, in which we insert the position x i and keypoint type k of all local maxima in the Hough score maps h_k(x) (sever as candidate seeds for starting a detection instance)
2. pop elements out in descentding score order
3. if the position x_i of the current candidate detection seed of type k is within a disk D_r(y_{j’},k) of the corresponding keypoint of previously detected person instances j’, then we reject it
4. otherwise, start a new detection instance j with the k-th keypoint at position y_{j,k} = x_i serving as seed
5. follow mid-range displacement vectors to connect pairs, setting y_{j,l} = y_{j,k} + M_{k,l} (y_{j,k})

## keypoint score

Expected-OKS

$s_{j,k} = \mathrm{E}\left\{ {OKS}_{j,k} \right\} = p_k (y_{j,k}) \int_{x \in \mathcal{D}_R (y_{j,k})} \hat{h}_k (x) \exp \left( - \frac{(x-y_{j,k})^2}{2 \lambda_j^2 \kappa_k^2} \right) dx$

## instance-level score

soft-NMS

we use as ﬁnal instance-level score the sum of the scores of the keypoints that have not already been claimed by higher scoring instances

$s_j = (1/K) \sum_{k=1:K} s_{j,k} \left[ \lVert y_{j,k} - y_{j',k} \rVert > r,\text{ for every } j'

# Instance-level person segmentation

Given keypoint-level person instance detections, identify pixels that belong to people (recognition) and associate them with the detected person instances (grouping)

1. Semantic person segmentation in standard fully-convolutional fashion
2. Associating segments with instances via geometric embeddings Long-range offset

$L_k(x) = y_{j,k} - x$

Refinement

$L_k (x) \leftarrow x' + L_k (x'), x'=L_k(x) \text{ and } L_k(x) \leftarrow x' + S_k (x'), x'=L_k(x)$ Geometric embedding

$G(x) = (G_k(x))_{k=1,\dots,K}, G_k(x) = x+L_k(x)$

Embedding distance

To decide if the person pixel x_i belongs to the j-th person instance, we compute the embedding distance metric

$D_{i,j} = \frac{1}{\sum_k p_k (y_{j,k})} \sum_{k=1}^K p_k(y_{j,k}) \frac{1}{\lambda_j} \lVert G_k(x_i) - y_{j,k} \rVert$

set \lambda_j equal to the square root of the area of the bounding box tightly containing all detected keypoints of the j-th person instance.

With N_s pixels and M person instances, having complexity O(N_S * M) instead of O(N_S * N_S) of standard embedding-based segmentation techniques which, at least in principle, require computation of embedding vector distances for all pixel pairs.

Procedure

1. find all positions x_i marked as person in semantic sementation map, i.e. those pixels that have semantic segmentation probability p_S(x_i) >= 0.5.
2. associate each person pixel x_i with every detected person instance j for which the embedding distance metric satisﬁes D_{i,j} <= t

Author:
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint polocy. If reproduced, please indicate source Texot !
Previous Zotero Config

2018-05-29
Next Paper Reading: Human Pose Regression by Combining Indirect Part Detection and Contextual Information
Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers. Di
2018-03-19
TOC