Note: this post is only meant for personal digestion and interpretation. It is incomplete and may mislead readers.
Title: PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model
George Papandreou, Tyler Zhu, Liang-Chieh Chen, Spyros Gidaris, Jonathan Tompson, Kevin Murphy
Pose
Detection

Hough voting
B is bilinear interpolation kernel.
S_k is short-range offset
Grouping
we add to our network a separate pairwise mid-range 2-D offset field output M_{k,l}(x) designed to connect pairs of keypoints. We compute 2(K − 1) such offset fields, one for each directed edge connecting pairs (k, l) of keypoints which are adjacent to each other in a tree-structured kinematic graph of the person
Mid-range pairwise offsets
Target:
Recurrent offset refinement
Fast greedy decoding
- create priority queue, shared across all K keypoint types, in which we insert the position x i and keypoint type k of all local maxima in the Hough score maps h_k(x) (sever as candidate seeds for starting a detection instance)
- pop elements out in descentding score order
- if the position x_i of the current candidate detection seed of type k is within a disk D_r(y_{j’},k) of the corresponding keypoint of previously detected person instances j’, then we reject it
- otherwise, start a new detection instance j with the k-th keypoint at position y_{j,k} = x_i serving as seed
- follow mid-range displacement vectors to connect pairs, setting y_{j,l} = y_{j,k} + M_{k,l} (y_{j,k})
Scoring
keypoint score
Expected-OKS
instance-level score
soft-NMS
we use as final instance-level score the sum of the scores of the keypoints that have not already been claimed by higher scoring instances
Instance-level person segmentation
Given keypoint-level person instance detections, identify pixels that belong to people (recognition) and associate them with the detected person instances (grouping)
- Semantic person segmentation in standard fully-convolutional fashion
- Associating segments with instances via geometric embeddings

Long-range offset
Refinement

Geometric embedding
Embedding distance
To decide if the person pixel x_i belongs to the j-th person instance, we compute the embedding distance metric
set \lambda_j equal to the square root of the area of the bounding box tightly containing all detected keypoints of the j-th person instance.
With N_s pixels and M person instances, having complexity O(N_S * M) instead of O(N_S * N_S) of standard embedding-based segmentation techniques which, at least in principle, require computation of embedding vector distances for all pixel pairs.
Procedure
- find all positions x_i marked as person in semantic sementation map, i.e. those pixels that have semantic segmentation probability p_S(x_i) >= 0.5.
- associate each person pixel x_i with every detected person instance j for which the embedding distance metric satisfies D_{i,j} <= t